“Everything is related to everything else, but near things are more related than distant things.”

— First law of geography (Tobler 1970)

1

# Libraries
library(tidyverse)
library(spdep)
library(rgdal)
library(sf)
library(sp)
library(rgeos)
library(terra)
library(DT)

colours = colorRampPalette(c("#fedb71","#FCD471","#facd71","#f6bf70","#eea26f","#EA946F","#e6866e","#e2786e","#e0716e","#dd696d"))

Reading data

The necessary data for this type of analysis are:

# Read municipality data
tn <- readOGR("../data/aggregated_data_per_municipality", encoding="UTF-8")
## OGR data source with driver: ESRI Shapefile 
## Source: "G:\Il mio Drive\2nd Year\Geospatial\TrentinoSchools\data\aggregated_data_per_municipality", layer: "aggregated_data_per_municipality"
## with 166 features
## It has 16 fields
## Integer64 fields read as strings:  Pop under Pop_mat Pop_ele Pop_med Pop_sup Popolazion
# Setting the CRS
tn <- spTransform(tn, CRS("+init=epsg:4326"))

# Function to replace NAs with 0
na.zero <- function (x) {
    x[is.na(x)] <- 0
    return(x)
}

names(tn@data)[3] = "Scuole"
names(tn@data)[7] = "Media_stud_classe"
names(tn@data)[8] = "Media_stud_scuola"
names(tn@data)[9] = "Pop_under_20"
names(tn@data)[15] = "Pop_under_20_Pop_tot"
names(tn@data)[16] = "Stud_Pop_under_20"
# Replace NAs about municipalities' number of schools with 0s
tn@data$Scuole = na.zero(tn@data$Scuole)

# Plot the municipalities in Trentino
par(mar=c(0,0,0,0))
plot(tn, axes = F, border="grey")

It is worth noticing from the map of school points that many of them overlap, especially in the Adige’s Valley and most populated areas, such as Trento, Rovereto and Riva del Garda.

# Import shapefile as SpatialPointsDataFrame
schools <- readOGR("../data/Trentino/schools/schools.shp",verbose = FALSE)

# Setting the CRS
schools <- spTransform(schools, CRS("+init=epsg:4326"))

# Plot schools over Trentino's map
par(mar=c(0,0,0,0))
plot(tn, border = "grey", axes = F)
points(schools@coords, col = "#f27059", cex = 1, pch = 1)

Descriptive spatial statistics (global analysis)

Centroids

Before starting with the global analysis of Trentino municipalities, it is necessary to select some representative points for each municipality as unique reference to the spatial coordinate.

par(mar=c(0,0,0,0))
plot(tn, border="grey")
points(coordinates(tn), 
       col="red", 
       bg = "#EF798A", 
       pch = 21,  
       lwd = 1.5)

Commonly, the centroid is a good choice, but still some problems can emerge and the centroid may not fall inside the boundaries of the territory. In this particular case, this may happen with multipolygons shape, i.e. those municipalities represented by a multiple number of polygons that do not share a boundary. Some examples are the municipalities of Tione di Trento, Ronzone, Stenico, Calliano, Pellizzano and Riva del Garda. Some municipalities, as Luserna, find their centroid in another territory because of their shape.

The plot depicts all those municipalities whose centroid is outside their actual territory. As can be seen, most of them have little zones outside the main one, while others (e.g. Predaia and Luserna) just have awkward shapes.

tn_coords = coordinates(tn)
tn@data$centroid = tn_coords
colorize = c()
for(i in 1:166){
  colorize = append(colorize, point.in.polygon(tn$centroid[i,1],
                                               tn$centroid[i,2],
                                               tn@polygons[i][[1]]@Polygons[[1]]@coords[,1],
                                               tn@polygons[i][[1]]@Polygons[[1]]@coords[,2]))
}
par(mar = c(0,0,0,0))
plot(tn, col = ifelse(colorize, "white","lightgrey"), border="grey")
text(tn_coords, labels = ifelse(colorize, "" ,tn@data$Comune), cex=0.6)

Instead of recurring to coordinates(), we could use gCentroid to obtain an alternative version of centroids. For most of the cases, these two versions coincide, but for those municipalities with multiple polygons points may differ (e.g. Soraga di Fassa in the upper right part). Since it computes a sort of mean point, making the centroid being part of other territories, the rest of the notebook will relate to the previous version (i.e. coordinates), despite inaccurate in some cases.

par(mar=c(0,0,0,0))
trueCentroids = gCentroid(tn,byid=TRUE)
plot(tn, border="grey")
points(coordinates(tn),pch=1, col="blue")
points(trueCentroids,pch=2, col="red")

⚠️ Note that, instead, school dataset contains points and not polygons, therefore no problem occurs with the centroid computation. Also, since many schools have same coordinates because in the same building, the centroids will only consider unique points, reducing the dataset from 724 to 599 unique points.

K-Nearest Neighbour

Since there are various definitions of neighbourhood, we will try to explore three of them in the following sections, starting from K-nearest neighbour.

Since there is no way to choose a specific value for \(k\), we can iterate over a customized range, let’s say from 1 to 20 neighbours (i.e. schools). The following code generates an image for each value of \(k\).

# Saving schools coordinates
school_coords = coordinates(schools)
# Save frame per frame
for(i in 1:20){
    png(paste0("../viz/knn/",i,".png"),res=300, width=1000, height=1000)
    k <- knn2nb(knearneigh(school_coords, k = i, longlat=T))
    par(mar = c(0,0,0,0))
    plot(tn, border = "grey80", axis = tn, lwd=0.5)
    plot(k, school_coords, lwd=.6, col=alpha("#F27059",alpha=0.5), 
       cex = .3, add=TRUE, points=FALSE)
    dev.off()
}

⚠️ Note that through the function saveGIF()4 it would be possible to save frames as animation, lowering the resolution of the image obtained. Saving each frame will also allow users to select a customized value for \(k\) and look at how the map changes inside the website.

KNN on Trentino Schools

KNN on Trentino Schools

As \(k\) is increased, each school finds more and more neighbours, approaching also further points. If we focus on the first frame with \(k=1\), we may notice that some schools are isolated, since their closest neighbour requires long lines to be reached. An example is the municipality of Vermiglio, in the upper left part of Trentino, in the western boundary. There are in fact \(3\) schools in Vermiglio and one of them is near the boundary, far from other schools. A similar situation happens to Rabbi and Malé, to Luserna and Lavarone, but still Vermiglio is the municipality with the longest distance between schools within the local territory.

k <- knn2nb(knearneigh(school_coords, k = 1, longlat=T))
par(mar = c(0,0,0,0))
plot(tn, border = "grey80", axis = tn, lwd=0.5)
plot(k, school_coords, lwd=.6, 
     col=alpha("#F27059",alpha=0.5), 
     cex = .3, add=TRUE, points=FALSE)
text(tn_coords, labels = ifelse(tn@data$Comune %in% c("Vermiglio","Rabbi","Malé","Luserna","Lavarone","Predaia"),tn@data$Comune, ""), cex=0.6)

On the other hand, at the extreme opposite, with \(k=20\), we obtain the network of schools in Trentino, looking at the \(20\) closest schools around each point. Trento and Rovereto are the most intertwined zones, while green areas as Adamello-Brenta Natural Park (west boundary) and Valsugana (empty zone in the right part of the map) lack in the number of schools. In fact, as can be seen by the length of lines, distances between schools tend to increase by moving from the Adige Valley to the east and west boundary.

KNN with k=20

KNN with k=20

If instead we focus on the municipalities, we can limit to a lower \(k\), since we talk about polygons with other territories at their boundary, while some schools points may result isolated with low values of \(k\). By looking at \(k=5\) plot, we may notice how all municipalities are linked to each other, despite it does not seem so in the territory above Borgo Valsugana, where Dolomites can be found.

knn1 = knn2nb(knearneigh(tn_coords, k = 1, longlat=T))
knn2 = knn2nb(knearneigh(tn_coords, k = 2, longlat=T))
knn3 = knn2nb(knearneigh(tn_coords, k = 3, longlat=T))
knn4 = knn2nb(knearneigh(tn_coords, k = 4, longlat=T))
knn5 = knn2nb(knearneigh(tn_coords, k = 5, longlat=T))

par(mar=c(0,0,0,0))
plot(tn, border = "grey80", axis = tn, lwd=0.5)
    plot(knn5, tn_coords, lwd=.6, col=alpha("#F27059",alpha=0.5), 
         cex = .3, add=TRUE, points=FALSE)

Critical cut-off

Our aim in this section will be to find the minimum threshold distance which allows all regions/points to have at least one neighbour. By setting \(k=1\) in the k-nearest neighbour, we can first compute the nearest neighbour to each school and the relative distance and then get the maximum distance among them.

knn1<- knn2nb(knearneigh(school_coords,
                         k=1,
                         longlat=T))
all.linked <- max(unlist(nbdists(knn1,
                                 school_coords,
                                 longlat=T))) 
all.linked
## [1] 9.182375

According to the results, all schools have a neighbour at at least 9.1823746 km. This implies that the cut-off distance has to be greater than it. However, notice from the following plot the distribution of school distances: the majority of them is near \(0\)km, following a long tail distribution. This may happen in big cities, as Trento and Rovereto, where there are a lot of schools and the minimum distance between them lowers.

distances = unlist(nbdists(knn1,school_coords,longlat=T))
ggplot()+
    geom_histogram(aes(x=distances), fill='#F27059', bins=50)+
    labs(title = "Distribution of distances between schools")+
  theme_minimal()

knn1<- knn2nb(knearneigh(tn_coords,
                         k=1,
                         longlat=T))
all.linked <- max(unlist(nbdists(knn1,
                                 tn_coords,
                                 longlat=T))) 
all.linked
## [1] 11.16466

We can repeat the same computation on municipalities centroids, discovering that every municipality has a neighbour at at least 11.1646609 km, slightly greater than the previous cut-off distance, which could mean that there are multiple schools in every municipality or that they are close enough to the boundary to be neighbour of other municipalities’ schools. Analyzing the distribution of distances between municipalities, we may notice that the majority of them distances from others from \(2\) to \(4\)km.

distances = unlist(nbdists(knn1,tn_coords,longlat=T))
ggplot()+
    geom_histogram(aes(x=distances), fill='#F27059', bins=50)+
    labs(title="Distribution of distances between schools")+
  theme_minimal()

We can try different neighbourhood definitions for different values of the cut-off distance, starting from the minimum threshold found before (i.e. \(9.18\)km).

dnb10 <- dnearneigh(school_coords, 0, 10, longlat=TRUE); dnb10
## Neighbour list object:
## Number of regions: 724 
## Number of nonzero links: 54326 
## Percentage nonzero weights: 10.36408 
## Average number of links: 75.03591
dnb15 <- dnearneigh(school_coords, 0, 15, longlat=TRUE); dnb15
## Neighbour list object:
## Number of regions: 724 
## Number of nonzero links: 93048 
## Percentage nonzero weights: 17.75129 
## Average number of links: 128.5193
dnb20 <- dnearneigh(school_coords, 0, 20, longlat=TRUE); dnb20
## Neighbour list object:
## Number of regions: 724 
## Number of nonzero links: 140056 
## Percentage nonzero weights: 26.71927 
## Average number of links: 193.4475
dnb25 <- dnearneigh(school_coords, 0, 25, longlat=TRUE); dnb25
## Neighbour list object:
## Number of regions: 724 
## Number of nonzero links: 192010 
## Percentage nonzero weights: 36.63083 
## Average number of links: 265.2072
dnb30 <- dnearneigh(school_coords, 0, 30, longlat=TRUE); dnb30
## Neighbour list object:
## Number of regions: 724 
## Number of nonzero links: 247346 
## Percentage nonzero weights: 47.18759 
## Average number of links: 341.6381

As the cut-off distance increases, the number of links grows rapidly. Based on the visualization, we could have stopped at 20, where nearly every school is connected to others.

plot_neighbour = function(model, coords, title){
    par(mar=c(0,0,1,0))
    plot(tn, border="grey",xlab="",ylab="",xlim=NULL)
    title(main=title, cex.main=0.8) 
    plot(model, coords, add=TRUE, col="#F27059", pch=16, lwd = 1, points=FALSE)
}

par(mfrow = c(3,2))
plot_neighbour(dnb10, school_coords, "d nearest neighbours, d = 10")
plot_neighbour(dnb15, school_coords, "d nearest neighbours, d = 15")
plot_neighbour(dnb20, school_coords, "d nearest neighbours, d = 20")
plot_neighbour(dnb25, school_coords, "d nearest neighbours, d = 25")
plot_neighbour(dnb30, school_coords, "d nearest neighbours, d = 30")

The same approach could be applied to municipalities data, obviously creating a network based on the territories around a certain area. Remembering that the cut-off threshold in this case is above \(11\), we can start with \(12\).

dnb12 <- dnearneigh(tn_coords, 0, 12, longlat=TRUE); dnb12
## Neighbour list object:
## Number of regions: 166 
## Number of nonzero links: 2000 
## Percentage nonzero weights: 7.257947 
## Average number of links: 12.04819
dnb16 <- dnearneigh(tn_coords, 0, 16, longlat=TRUE); dnb16
## Neighbour list object:
## Number of regions: 166 
## Number of nonzero links: 3314 
## Percentage nonzero weights: 12.02642 
## Average number of links: 19.96386
dnb20 <- dnearneigh(tn_coords, 0, 20, longlat=TRUE); dnb20
## Neighbour list object:
## Number of regions: 166 
## Number of nonzero links: 4866 
## Percentage nonzero weights: 17.65859 
## Average number of links: 29.31325
dnb24 <- dnearneigh(tn_coords, 0, 24, longlat=TRUE); dnb24
## Neighbour list object:
## Number of regions: 166 
## Number of nonzero links: 6712 
## Percentage nonzero weights: 24.35767 
## Average number of links: 40.43373
dnb30 <- dnearneigh(tn_coords, 0, 30, longlat=TRUE); dnb30
## Neighbour list object:
## Number of regions: 166 
## Number of nonzero links: 9652 
## Percentage nonzero weights: 35.02685 
## Average number of links: 58.14458
par(mfrow = c(3,2))
plot_neighbour(dnb12, tn_coords, "d nearest neighbours, d = 12")
plot_neighbour(dnb16, tn_coords, "d nearest neighbours, d = 16")
plot_neighbour(dnb20, tn_coords, "d nearest neighbours, d = 20")
plot_neighbour(dnb24, tn_coords, "d nearest neighbours, d = 24")
plot_neighbour(dnb30, tn_coords, "d nearest neighbours, d = 30")

Also in this case the number of connections grows rapidly, indicating how close the municipalities are between each other. Consider that the maximum distance within province of Trento between municipalities is around 120km (considering the centroids, therefore the actual distance is greater), but over the \(75\%\) of municipalities has an area below \(50km^2\), which allows them to be connected with brief distances.

# Quantiles of areas of municipalities within the Province of Trento
tn@data$area = round(area(tn)/ 1000000,3)
area = tn@data %>%
    arrange(desc(area)) %>%
    select(Comune, area) 
quantile(area$area)
##        0%       25%       50%       75%      100% 
##   1.65200  12.95100  25.57700  48.97075 199.55200
# Find max distance within the province centroids
library(geosphere)
## Warning: il pacchetto 'geosphere' è stato creato con R versione 4.1.2
diff = c()
for(i in 1:166 ){
  for (j in 1:166){
 diff = append(diff, distm(tn_coords[i,],tn_coords[j,], fun = distHaversine))
  }
}
# Maximum distance between centroids within the province of Trento
max(diff)/1000
## [1] 120.7196

Contiguity based approach

Since schools are represented as points, we will use municipalities data to connect territories with common boundary, i.e. multiple municipalities around. From the visualization it is worth noticing the spiderweb created around the municipalities on the inside of Trentino, while those more disconnected are placed on the border of the Province, especially the territories on the upper right part of the map (e.g. Canazei, San Giovanni di Fassa, Mazzin, Campitello di Fassa, Soraga di Fassa, Moena).

par(mar=c(0,0,0,0))
contnb_q <- poly2nb(tn, queen=T)
plot(tn, border="grey")
plot(contnb_q, tn_coords, add=TRUE, col="#EF798A")
points(coordinates(tn), 
       col="red", 
       bg = "#EF798A", 
       pch = 21,  
       lwd = 1.5)

⚠️ Note that there are 166 municipalities in the Province of Trento. By sorting them according to the shape area, we get that the biggest areas do not share at least one boundary, since they take place on the border or in mountainous zones; while Trento occupies a central position.

area%>%
  DT::datatable()

Spatial Weights

After the definition of neighbourhoods, we can create spatial weights matrices, one for each neighbour list previously created.

# K-nearest neighbour
knn1.list = nb2listw(knn1)
knn2.list = nb2listw(knn2)
knn3.list = nb2listw(knn3)
knn4.list = nb2listw(knn4)
knn5.list = nb2listw(knn5)
# Critical cut-off
dnb12.list = nb2listw(dnb12,style="W")
dnb16.list = nb2listw(dnb16,style="W")
dnb20.list = nb2listw(dnb30,style="W")
dnb24.list = nb2listw(dnb24,style="W")
dnb30.list = nb2listw(dnb30,style="W")
# Contiguity based approach
contnb_q.list = nb2listw(contnb_q)

# List with weights lists and their name
weights = list(
    list(knn1.list, "K-nearest neighbour (k=1)"),
    list(knn2.list, "K-nearest neighbour (k=2)"),
    list(knn3.list, "K-nearest neighbour (k=3)"),
    list(knn4.list, "K-nearest neighbour (k=4)"),
    list(knn5.list, "K-nearest neighbour (k=5)"),
    list(dnb12.list, "Critical cut-off neighbourhood (d=12)"),
    list(dnb16.list, "Critical cut-off neighbourhood (d=16)"),
    list(dnb20.list, "Critical cut-off neighbourhood (d=20)"),
    list(dnb24.list, "Critical cut-off neighbourhood (d=24)"),
    list(dnb30.list, "Critical cut-off neighbourhood (d=30)"),
    list(contnb_q.list, "Contiguity-based neighbourhoord")
)

Moran’s I test of spatial autocorrelation

In the section, we will focus on Moran’s I test of spatial autocorrelation of Trentino Schools, in particular on the number of schools and students that populate every municipality. Let’s start plotting the distribution of some features over the territory.

⚠️ Note that we do not hold data about students of all schools in Trentino, therefore some data might be missing. In these cases, the plots below will show the respective area in white.

cols = list(tn$Scuole,tn$Studenti, tn$Classi, tn$Media_stud_scuola, tn$Pop_under_20_Pop_tot, tn$Stud_Pop_under_20, tn$Media_stud_classe)
titles = c("Schools","Students","Classes","Mean of students per school","Population under 20 over Total Population", "Students over Population under 20", "Mean students per class")
colours <- c("#fedb71","#FCD471","#facd71","#f6bf70","#eea26f","#EA946F","#e6866e","#e2786e","#e0716e","#dd696d")

na.ignore = function(x){
  x[is.na(x)] <- -1
  return(x)
}

par(mfrow=c(4,2),mar = c(0,0,1.7,0))
for(i in 1:7){
    c = na.ignore(unlist(cols[i]))
    brks <- round(quantile(c, seq(0,1,0.1)), digits=3)
    plot(tn, col=ifelse(c==-1,
                        "#ffffff",
                        colours[findInterval(c, brks, all.inside=TRUE)]),
         main = titles[i], cex.main=2.5)
}

Now we can try to compute the Moran’s test based on all the previous definitions of neighborhood and with the previous features exposed, trying to find some spatial autocorrelation.

Neighbourhood = c()
Column = c()
Sd = c()
p_value = c()
Moran_I_statistic = c()
Mean = c()
Var = c()
Assumption = c()

# Iterate over columns
for(i in 1:length(cols)){
  # Iterate over neighbourhood
  for (w in weights) {
    c = na.zero(unlist(cols[i]))
    # Iterate over assumptions
    for (rand in c(T,F)) {
      Neighbourhood = append(Neighbourhood, w[[2]])
      res = moran.test(c, w[[1]], randomisation = rand)
      Column = append(Column, titles[i])
      Sd = append(Sd, round(as.numeric(res[[1]]),4))
      p_value = append(p_value, round(as.numeric(res[[2]]), 4))
      Moran_I_statistic = append(Moran_I_statistic, round(as.numeric(res[[3]][1]),4))
      Mean = append(Mean, round(as.numeric(res[[3]][2]),4))
      Var = append(Var, round(as.numeric(res[[3]][3]),4))
      if(rand) {
        Assumption = append(Assumption, "Randomization")
      }else{
        Assumption = append(Assumption, "Normality")
      }
    }
    Neighbourhood = append(Neighbourhood, w[[2]])
    res = moran.mc(c, w[[1]], nsim=999)
    Column = append(Column, titles[i])
    Sd = append(Sd,round(as.numeric(res[[1]]),4))
    p_value = append(p_value, round(res$p.value),4)
    Moran_I_statistic = append(Moran_I_statistic, round(res$statistic,4))
    Mean = append(Mean, "")
    Var = append(Var, "")
    Assumption = append(Assumption, res$method)
  }
}

# create df with results and show them
res_df = data.frame(Column, Neighbourhood, Moran_I_statistic, p_value,
                    Sd, Mean, Var, Assumption)

res_df %>%
  arrange(desc(abs(Moran_I_statistic)), p_value) %>%
  DT::datatable()

By ordering results according to the absolute value of the Moran’s I statistics, we can see that the highest statistics obtained are around the \([0.1789, 0.1952]\) interval, got in all three assumptions. Albeit mean students per school seem to be the column with highest Moran’s statistics, the p-value is not below the threshold of \(0.05\) for the majority of its observations. In fact, the mean of students per school results significative with neighbourhoods knn (k=5) and critical cut-off with d=16.

On the other hand, the proportion of Population under 20 over the total population seems significative with every configuration, except for some knns neighbourhoods. However, Moran’s statistics is lower than the ones gathered considering the mean of students per school.

Both these two columns show a positive spatial autocorrelation, while negative ones are associated to the number of Students over the Population under 20, with low p-value and minimum value \(-0.1296\) for Moran’s I statistics.

res_df%>%
  group_by(Column) %>%
  summarise("Median Moran" = median(Moran_I_statistic), "Median p-value" =  median(p_value), 
            "Mean Moran" = round(mean(Moran_I_statistic),4), "Mean p-value" = round(mean(p_value),4)) %>%
  DT::datatable()

Considering an aggregated table with median and mean statistics and p-value, we can confirm that the columns with highest statistics are:

The low mean and median p-values for Students confirms the absence of spatial autocorrelation based on the number of students. Nearly the same happens to the number of Classes and Schools.

Moran’s I test of spatial autocorrelation in OLS residuals

Residuals test with Mean of students per class

Let’s start by trying to model the mean students per class with all the remaining features we have explored. The summary shows that all predictors are significant, except for the population under 20 over total population and the number of students.

LinearMean <- lm(Media_stud_classe ~ Stud_Pop_under_20+Scuole+Classi+Media_stud_scuola+Pop_under_20_Pop_tot+Studenti, tn, na.action = na.ignore)
summary(LinearMean)
## 
## Call:
## lm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole + 
##     Classi + Media_stud_scuola + Pop_under_20_Pop_tot + Studenti, 
##     data = tn, na.action = na.ignore)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4646  -1.1711  -0.0065   1.2772   7.8307 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.844835   1.541773   3.142    0.002 ** 
## Stud_Pop_under_20     6.955752   0.425146  16.361  < 2e-16 ***
## Scuole                0.839792   0.133982   6.268 3.30e-09 ***
## Classi               -0.376734   0.050486  -7.462 5.24e-12 ***
## Media_stud_scuola     0.040702   0.003932  10.352  < 2e-16 ***
## Pop_under_20_Pop_tot  8.852831   8.258840   1.072    0.285    
## Studenti              0.014127   0.002546   5.549 1.18e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.213 on 159 degrees of freedom
## Multiple R-squared:  0.906,  Adjusted R-squared:  0.9024 
## F-statistic: 255.3 on 6 and 159 DF,  p-value: < 2.2e-16

With the step function it is possible to simplify this model considering only those features with high significance, excluding the features with no significance (i.e. Pop_under_20_Pop_tot).

# Searching for a simplified model where every feature has high significance
LinearMean = step(LinearMean)
## Start:  AIC=270.54
## Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + 
##     Pop_under_20_Pop_tot + Studenti
## 
##                        Df Sum of Sq     RSS    AIC
## - Pop_under_20_Pop_tot  1      5.63  784.17 269.74
## <none>                               778.54 270.54
## - Studenti              1    150.77  929.31 297.93
## - Scuole                1    192.37  970.91 305.20
## - Classi                1    272.65 1051.20 318.39
## - Media_stud_scuola     1    524.68 1303.22 354.06
## - Stud_Pop_under_20     1   1310.68 2089.22 432.41
## 
## Step:  AIC=269.74
## Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + 
##     Studenti
## 
##                     Df Sum of Sq     RSS    AIC
## <none>                            784.17 269.74
## - Studenti           1    147.24  931.41 296.30
## - Scuole             1    200.34  984.51 305.51
## - Classi             1    269.92 1054.09 316.84
## - Media_stud_scuola  1    638.28 1422.44 366.59
## - Stud_Pop_under_20  1   1306.95 2091.12 430.56
summary(LinearMean)
## 
## Call:
## lm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole + 
##     Classi + Media_stud_scuola + Studenti, data = tn, na.action = na.ignore)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5592  -1.1207   0.0051   1.2213   7.7849 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.442776   0.393632  16.368  < 2e-16 ***
## Stud_Pop_under_20  6.908668   0.423067  16.330  < 2e-16 ***
## Scuole             0.853246   0.133455   6.394 1.70e-09 ***
## Classi            -0.374531   0.050468  -7.421 6.46e-12 ***
## Media_stud_scuola  0.042152   0.003694  11.412  < 2e-16 ***
## Studenti           0.013921   0.002540   5.481 1.61e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.214 on 160 degrees of freedom
## Multiple R-squared:  0.9053, Adjusted R-squared:  0.9023 
## F-statistic: 305.9 on 5 and 160 DF,  p-value: < 2.2e-16

The plot of the studentized residuals of the linear model can give us a hint about the presence of spatial dependence in the residuals. In fact, some similarities may be found in the east part, near the border and within the Adamello Brenta Park.

par(mar=c(0,0,1,0))
studres <- rstudent(LinearMean)
resdistr <- round(quantile(studres, seq(0,1,0.25)), digits=3)
colours_5 <- colours(5)
plot(tn, col=colours_5[findInterval(studres, resdistr, all.inside=TRUE)],
     main = "Residuals quantiles in Trentino",
     border="grey20")

The command that allows to perform the Moran’s I test in the OLS residuals is the function lm.morantest(). In the following chunk, the test to the studentized residuals of the linear Solow model, for different specifications of the spatial weights matrix. This method will be applied to all neighbourhoods definition, except KNN with \(k\) lower than 4 and the contiguity neighbourhood, since they return unusual results.

ols_res = data.frame(Neighbourhood = c(""),
                     Moran = c(""),
                     p_value = c(""))
# Moran test on residuals
for(i in 4:11){
  t = lm.morantest(LinearMean,weights[[i]][[1]],resfun=rstudent)
  ols_res = rbind(ols_res, c(weights[[i]][[2]],
                             t$estimate['Observed Moran I'],
                             t$p.value))
}
ols_res$Moran = as.numeric(ols_res$Moran)
ols_res$p_value = as.numeric(ols_res$p_value)
ols_res %>%
  arrange(desc(abs(Moran))) %>%
  filter(!is.na(Moran)) %>%
  DT::datatable()

The obtained results provide a negative spatial autocorrelation, but the p-value is far from being below to the threshold to confirm this hypothesis. Therefore, no spatial autocorrelation is found in the residuals of this model.

Residuals test with Pop under 20 over total population

By focusing instead on the population under 20 over the total population for each municipality, we obtain from the summary that only the mean students per school is highly significative, while the students over population under 20’s p-value is slightly above \(0.05\).

LinearPop <- lm(tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20+Scuole+Classi+Media_stud_scuola+Media_stud_classe+Studenti, tn)
summary(LinearPop)
## 
## Call:
## lm(formula = tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + 
##     Classi + Media_stud_scuola + Media_stud_classe + Studenti, 
##     data = tn)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.059060 -0.011379  0.002452  0.010725  0.045638 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.778e-01  9.719e-03  18.296   <2e-16 ***
## Stud_Pop_under_20 -1.154e-02  6.268e-03  -1.842   0.0679 .  
## Scuole             9.683e-04  1.315e-03   0.736   0.4629    
## Classi             5.377e-04  4.986e-04   1.078   0.2829    
## Media_stud_scuola  1.387e-04  4.950e-05   2.802   0.0059 ** 
## Media_stud_classe  5.846e-04  9.422e-04   0.620   0.5361    
## Studenti          -3.461e-05  2.351e-05  -1.472   0.1435    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01878 on 124 degrees of freedom
##   (35 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.184,  Adjusted R-squared:  0.1445 
## F-statistic: 4.659 on 6 and 124 DF,  p-value: 0.0002638

By applying the step function, the mean of students per class and the number of classes are erased. However, the adjusted r-squared is too low to guarantee the quality of this model.

# Searching for a simplified model where every feature has high significance
LinearPop = step(LinearPop)
## Start:  AIC=-1034.61
## tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Classi + 
##     Media_stud_scuola + Media_stud_classe + Studenti
## 
##                     Df  Sum of Sq      RSS     AIC
## - Media_stud_classe  1 0.00013579 0.043877 -1036.2
## - Scuole             1 0.00019128 0.043933 -1036.0
## - Classi             1 0.00041026 0.044152 -1035.4
## <none>                            0.043742 -1034.6
## - Studenti           1 0.00076467 0.044506 -1034.3
## - Stud_Pop_under_20  1 0.00119657 0.044938 -1033.1
## - Media_stud_scuola  1 0.00276903 0.046511 -1028.6
## 
## Step:  AIC=-1036.2
## tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Classi + 
##     Media_stud_scuola + Studenti
## 
##                     Df Sum of Sq      RSS     AIC
## - Classi             1 0.0002938 0.044171 -1037.3
## - Scuole             1 0.0004336 0.044311 -1036.9
## - Studenti           1 0.0006498 0.044527 -1036.3
## <none>                           0.043877 -1036.2
## - Stud_Pop_under_20  1 0.0010645 0.044942 -1035.1
## - Media_stud_scuola  1 0.0094505 0.053328 -1012.6
## 
## Step:  AIC=-1037.33
## tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + 
##     Studenti
## 
##                     Df Sum of Sq      RSS     AIC
## <none>                           0.044171 -1037.3
## - Scuole             1 0.0007497 0.044921 -1037.1
## - Stud_Pop_under_20  1 0.0007708 0.044942 -1037.1
## - Studenti           1 0.0009575 0.045129 -1036.5
## - Media_stud_scuola  1 0.0092138 0.053385 -1014.5
summary(LinearPop)
## 
## Call:
## lm(formula = tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + 
##     Media_stud_scuola + Studenti, data = tn)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.059266 -0.011806  0.002099  0.010582  0.046054 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.826e-01  4.277e-03  42.690  < 2e-16 ***
## Stud_Pop_under_20 -7.607e-03  5.130e-03  -1.483    0.141    
## Scuole             1.649e-03  1.128e-03   1.462    0.146    
## Media_stud_scuola  1.596e-04  3.113e-05   5.127 1.08e-06 ***
## Studenti          -1.100e-05  6.658e-06  -1.653    0.101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01872 on 126 degrees of freedom
##   (35 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.1759, Adjusted R-squared:  0.1498 
## F-statistic: 6.725 on 4 and 126 DF,  p-value: 6.134e-05
ols_res = data.frame(Neighbourhood = c(""),
                     Moran = c(""),
                     p_value = c(""))
# Moran test on residuals
for(i in 4:11){
  t = lm.morantest(LinearPop,weights[[i]][[1]],resfun=rstudent)

  ols_res = rbind(ols_res, c(weights[[i]][[2]],
                             t$estimate['Observed Moran I'],
                             t$p.value))
}
ols_res$Moran = as.numeric(ols_res$Moran)
ols_res$p_value = as.numeric(ols_res$p_value)
ols_res %>%
  arrange(desc(abs(Moran))) %>%
  filter(!is.na(Moran)) %>%
  DT::datatable()

In this case instead, we can pay attention to the p-value, which is above \(0.05\) in all cases, except for knn with \(k=5\). This may indicate the presence of spatial autocorrelation in the residuals and consequently that the model is misspecified. Equivalent results have been obtained by rotating and discarding some predictors.

If spatial autocorrelation is present it will violate the assumption about the independence of residuals and call into question the validity of hypothesis testing6, leading to the rejection of the Population under 20 over the total population. However, since the spatial autocorrelation is found only in a single neighbourhood, it will still be studied in the next sections.

By looking at the plot we can see how the residuals in the most populated areas, as Trento and Rovereto, are low (i.e. yellow), while smaller municipalities around are red (i.e. high residuals).

par(mar=c(0,0,1,0))
studres <- rstudent(LinearPop)
resdistr <- round(quantile(studres, seq(0,1,0.25)), digits=3)
plot(tn, col=colours_5[findInterval(studres, resdistr, all.inside=TRUE)],
     main = "Residuals quantiles in Trentino")

Local Analysis

Population under 20 over Total Population

Moran’s Scatterplot on Population under 20 over Total Population

The following grid of images shows the Moran’s Scatterplot within all different notions of neighborhoods. The membership of municipalities inside a quadrant or another changes according to the neighbourhood definition. For instance, Novaledo often switches from HH to HL quadrant and Valfloriana from LH to LL and viceversa. Also, since this feature is a proportion that goes from \(0.10\) to \(0.24\), many municipalities overlap or assume the same value (columns of dots). Nevertheless, some outliers are noticeable, represented with a different style and their label.

As stated forehead, the quadrants with the pink line and its confidence interval shows the municipalities with similar proportion of students population over the total one (i.e. positive spatial autocorrelation). An example is the relationship between Novaledo and Vignola-Falesina, which are usually in the same HH quadrant, have a high proportion of students and they are close, geographically talking. On the opposite side, Cinte Tesino and Castello Tesino show a low value for students over the population.

On the other hand, municipalities on the remaining quadrants, have dissimilar values if they belong to opposite quadrants. For example, Fierozzo and Frassilongo share a boundary, but their proportion of students is \(0.22\) and \(0.15\) respectively, enhancing a big dissimilarity between these two municipalities.

This chunk shows how many municipalities belong to a specific quadrant, based on the different concepts of neighbourhood. It seems like 15 or less over 166 municipalities have a spatial autocorrelation and belong to a specific quadrant. These hotspots are sequently visualized with a colour, whose association is related to the membership to a specific quadrant.

Since the membership to quadrants is divided over the three different concepts of neighbourhood, we can observe that the critical cut-off neighbourhood has no hotspot in the HH quadrant, which represents the majority of municipalities for the contiguity approach.

## quadrants.knn
##   HH   HL   LH   LL None 
##    1    3    2    5  155
## quadrants.dnb
##   HL   LH   LL None 
##    4    3    7  152
## quadrants.cont
##   HH   HL   LH   LL None 
##    6    2    3    4  151

In order to better visualize them, the plot is served, showing in white the municipalities with no quadrant and the one inside the Moran’s quadrants with a gradient. Moreover, three of them are generated, one for each notion of neighbourhood, choosing a random \(k\) or \(d\) for knn and critical cut-off. This choice is due to the fact that different neighbourhoods lead to different hotspots, i.e. the points that emerge over others in the Moran’s scatterplot.

⚠️ It is suggested to open the image in a new window to better read municipalities labels, whose size has been reduced for readability issues.

par(mfrow=c(3,1))
par(mar=c(0,0,0,0))
quadrants = list(list(quadrants.knn,"KNN"), list(quadrants.dnb,"Cut-off"), list(quadrants.cont,"Contiguity"))
for(l in quadrants){
  colourization = unlist(color_mapping[l[[1]]])
  par(mar=c(0,0,1,0))
  plot(tn, col=colourization, border = "grey", main=paste0("Regions with influence on students over population under 20 (neighbourhood = ",l[[2]],")"))
  legend(x=11.38, y=45.95, legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
         fill=unlist(color_mapping), bty="n", cex=0.8)
  text(tn_coords, labels = ifelse(l[[1]]=="None", "" ,tn@data$Comune), cex=0.7)
}

As can be noticed, while critical cut-off and knn focus on the same eastern area of Trentino, the contiguity approach selects municipalities all over the province, in particular Trento and Rovereto, the most populated ones, and the least ones in the border.

While knn and contiguity show Novaledo as HH, KNN and Cut-off also show Frassilongo, Valfloriana, Castello Tesino and Cinte Tesino (and others not always highlighted before), described above as recurring ones in the not HH quadrants.

# Municipalities in common between KNN and cut-off
tn$Comune[quadrants.knn != "None" & quadrants.dnb != "None"]
## [1] "Castello Tesino"  "Cinte Tesino"     "Imer"             "Mezzano"         
## [5] "Palù Del Fersina" "Ruffrè-Mendola"   "Terragnolo"       "Valfloriana"
# Novaledo in common in contiguity and knn
tn$Comune[quadrants.cont != "None" & quadrants.knn != "None"]
## [1] "Novaledo"
# Nothing in common between contiguity and cut-off
tn$Comune[quadrants.cont != "None" & quadrants.dnb != "None"]
## character(0)

Local Moran’s I

Local statistics can be tested for deviations using the hypothesis of absence of local spatial autocorrelation and hence can provide the statistical significance of the local spatial patterns detected by the Moran scatterplot. In particular, the function localmoran() provides:

  • local moran statistic \(I_i\);
  • expectation of the statistic \(E(I_i)\);
  • variance of the local moran statistic \(Var(I_i)\)
  • standard deviate of the local moran statistic \(Z_i\);
  • p-value of the local moran statistic \(P(z != E(I_i))\).

By sorting results according to the municipality with highest absolute value of the local Moran statistic \(I_i\) and by filtering those whose p-value is below \(0.05\), we obtain the areas with highest statistical significance of the local spatial patterns, in a clearer way than the Moran scatterplot. Specifically, Sagron Mis, Cinte Tesino and Castello Tesino seem to be the municipalities with highest Local Moran’s, but comparing the top municipalities in lmI with those in the quadrants computed through the scatterplot, we may notice that only 41% of municipalities identified through the plot are spatially significant in the Local Moran’s.

lmI <- localmoran(tn$Pop_under_20_Pop_tot, dnb24.list, 
                  na.action = na.exclude)
lmI = data.frame(lmI) 
rownames(lmI) = tn$Comune
lmI_sign = lmI%>%
  filter(lmI$Pr.z....E.Ii..<0.05) %>%
  arrange(desc(abs(Ii)))
DT::datatable(lmI_sign)
# Proportion of significant municipalities in Moran's Scatterplot
# versus those identified through local Moran's
sum(c(tn$Comune[quadrants.knn != "None" | 
                                 quadrants.dnb != "None" | 
                                 quadrants.cont != "None"]) %in% rownames(lmI_sign))/dim(lmI_sign)[1]
## [1] 0.4054054

The following two plots show:

  • the territories where the Local Moran’s values are higher (darker colours) and lower (lighter colours). The labels are present only for those municipalities whose p-value is below \(0.05\). As can be noticed, labels are present in the extreme right of the Province of Trentino (e.g. Primiero San Martino di Castrozza, Castel Tesino, Predazzo etc), but also around the most populated areas (e.g. Trento, Riva del Garda etc).
par(mar=c(0,0,1,0))
brks <- sort(as.numeric(lmI[,1]))
colours <- colorRampPalette(c('#fedb71','#dd696d'))(length(lmI[,1]))
plot(tn, col=colours[findInterval(lmI[,1], brks, all.inside=TRUE)],
     border="grey30")
title(main="Local Moran's I values")
text(tn_coords, labels = ifelse(tn@data$Comune %in% rownames(lmI_sign), tn@data$Comune ,""), cex=0.7)

  • this second plot only highlights the municipalities whose p-value is below the threshold of \(0.05\), allowing to understand which areas have the lowest and the highest p-value and therefore the level of statistical significance of the spatial autocorrelation shown in the previous plot. Madruzzo, Garriga Terme, Tesero and Ziano di Fiemme are the municipalities with lowest p-value, but only Madruzzo has a high value for the local Moran’s \(I_i\), while others probably have a negative value for \(I_i\). Other territories around Riva del Garda and Castello Tesino have p-values around \(0.01\) and most of them positive spatial autocorrelation, except for Imer, Mezzano, Cavedine, Terragnolo, Scurele and Folgaria.
# Mapping the p-value as color
pvalue_colors = c("white","#fedb71", "#F6BF70", "#E6866E","#DD696D")
pval <- as.numeric(lmI[,5])
colpval = ifelse(pval>0.05, pvalue_colors[1],
                 ifelse(pval>0.01 & pval<=0.05, pvalue_colors[2],
                        ifelse(pval>0.001 & pval<=0.01,pvalue_colors[3],
                               ifelse(pval>0.0001,pvalue_colors[4],pvalue_colors[5])))) 
tn$colpval[pval>0.05] <- "white" 
tn$colpval[pval<=0.05 & pval>0.01] <- gray(0.9) 
tn$colpval[pval<=0.01 & pval>0.001] <- gray(0.7)
tn$colpval[pval<=0.001 & pval>0.0001] <- gray(0.4)
tn$colpval[pval<=0.0001] <- "black"

plot(tn, col=colpval)
legend(x=11.5, y=45.9, legend=c("Not significant", 
       "p-value = 0.05", "p-value = 0.01", "p-value = 0.001", 
       "p-value = 0.0001"), fill=pvalue_colors, bty="n", cex=1)
title(main="Local Moran's I significance map")
text(tn_coords, labels = ifelse(colpval == "white","", tn@data$Comune), cex=0.7)

Comparing the results obtained with the Scatterplots’, we can state that while the critical cut-off found significance in the eastern part (e.g. Primiero, Predazzo, Castello Tesino etc), the contiguity approach helped to find significance near Riva del Garda and around Trento.

Mean of students per school

Moran’s Scatterplot

The following grid of images shows the Moran’s Scatterplot within all different notions of neighborhoods, considering the mean of students per school. As in the previous case, the pink line has a positive slope, whose confidence interval continues to grow by increasing the \(x\) value. Still, some differences can be noticed from a neighbourhood to another and some municipalities, like Mezzocorona, Villa Lagarina and San Michele All’Adige, continues to switch from above to below the mean dashed line.

In HH quadrant, Trento, Rovereto, Mori, Mezzocorona, Avio, Lavis and Ala seem to be the municipalities with more students per school in the majority of plots, but also Brentonico, Giovo and Nago-Torbole, except when considering the critical cut-off neighbourhood.

On the LH part, communities like Drena (surrounded by Arco, Dro and Cavedine), Fornace (Baselga di Pinè and Civezzano), Ronzo-Chienis, Vallarsa (Ala, Rovereto) and Terragnolo (Rovereto, Folgaria) are those municipalities that have few students (less than \(100\) per school), but surrounded by municipalities with a lot of them.

This chunk shows how many municipalities belong to a specific quadrant, based on the different concepts of neighbourhood. It seems like 16 or less over 166 municipalities have a spatial autocorrelation and belong to a specific quadrant. In particular, both knn and cut-off neighbourhoods do not provide hotspots for the LL quadrant, while it exceeds with HH municipalities, compared to contiguity neighbourhood.

table(quadrants.knn)
## quadrants.knn
##   HH   HL   LH None 
##   12    1    2  151
table(quadrants.cont)
## quadrants.cont
##   HH   HL   LH   LL None 
##    3    2    6    5  150
table(quadrants.dnb)
## quadrants.dnb
##   HH   HL   LH None 
##    6    2    3  155

The map displays the membership of municipalities to each of the quadrant in the Moran’s scatterplot, while in white leaving the remaining territories with no quadrant.

Starting with KNN version, the Adige valley is completely in red, showing the spatial similarity of high number of students per school: from Mezzocorona, to Trento, continuing to the neighbourhood of Rovereto and the bottom-center part of Trentino.

The cut-off neighbourhood instead considers partially the same territories of KNN, with less focus on Rovereto and more on some LH municipalities, as Vallarsa, Trambileno and Terragnolo.

In the end, the contiguity approach detaches itself from previous approaches to show apparently random municipalities, some in common with knn and cut-off (e.g. Roverè della Luna, Lavis, Trambileno, Nago-Torbole and Mori).

par(mfrow=c(3,1))
par(mar=c(0,0,0,0))
quadrants = list(list(quadrants.knn,"KNN"), list(quadrants.dnb,"Cut-off"), list(quadrants.cont,"Contiguity"))
for(l in quadrants){
  colourization = unlist(color_mapping[l[[1]]])
  par(mar=c(0,0,1,0))
  plot(tn, col=colourization, border = "grey", main=paste0("Regions with influence on mean number of students per school (neighbourhood = ",l[[2]],")"))
  legend(x=11.38, y=45.95, legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
         fill=unlist(color_mapping), bty="n", cex=0.8)
  text(tn_coords, labels = ifelse(l[[1]]=="None", "" ,tn@data$Comune), cex=0.7)
}

# Municipalities in common between KNN and cut-off
tn$Comune[quadrants.knn != "None" & quadrants.dnb != "None"]
## [1] "Avio"                  "Lavis"                 "Levico Terme"         
## [4] "Mezzocorona"           "Mori"                  "San Michele All'Adige"
## [7] "Trento"                "Villa Lagarina"
# Novaledo in common in contiguity and knn
tn$Comune[quadrants.cont != "None" & quadrants.knn != "None"]
## [1] "Lavis"             "Mori"              "Nago-Torbole"     
## [4] "Roverè Della Luna"
# Nothing in common between contiguity and cut-off
tn$Comune[quadrants.cont != "None" & quadrants.dnb != "None"]
## [1] "Lavis"      "Mori"       "Trambileno"

Local Moran’s I

This time, we will focus on the mean of students per school, whose values comprehend also NAs, which will be excluded. As before, we create a dataframe ordered by the absolute value of Local Moran’s \(I_i\), filtered by the p-value below \(0.05\).

lmI <- localmoran(tn$Media_stud_scuola, dnb24.list, na.action = na.exclude)
lmI = data.frame(lmI) 
rownames(lmI) = tn$Comune
lmI_sign = lmI%>%
  filter(lmI$Pr.z....E.Ii..<0.05) %>%
  arrange(desc(abs(Ii)))
DT::datatable(lmI_sign)
# Proportion of significant municipalities in Moran's Scatterplot
# versus those identified through local Moran's
sum(c(tn$Comune[quadrants.knn != "None" | 
                                 quadrants.dnb != "None" | 
                                 quadrants.cont != "None"]) %in% rownames(lmI_sign))/dim(lmI_sign)[1]
## [1] 0.2926829

41 municipalities find a significant statistic for the local spatial autocorrelation and just \(29\%\) are in common with the ones found in the Moran scatterplots’ quadrants.

The following two plots show:

  • the territories where the Local Moran’s values are higher (darker colours) and lower (lighter colours). The labels are present only for those municipalities whose p-value is below \(0.05\). The white areas are those for which we do not detain students’ data. Most of the statistical significant municipalities are in the Valsugana or around the Rovereto-Riva del Garda zone.
par(mar=c(0,0,1,0))
brks <- sort(as.numeric(lmI[,1]))
colours <- colorRampPalette(c('#fedb71','#dd696d'))(length(lmI[,1]))
plot(tn, col=colours[findInterval(lmI[,1], brks, all.inside=TRUE)],
     border="grey30")
title(main="Local Moran's I values")
text(tn_coords, labels = ifelse(tn@data$Comune %in% rownames(lmI_sign), tn@data$Comune ,""), cex=0.7)

  • however, when looking to the p-value, Calliano, Terragnolo, Pomarolo, Volano and Nomi are the areas with lowest p-value (therefore highest significance). All these municipalities can be found as Rovereto and Trento’s neighbours. However, Nomi and Terragnolo have a negative spatial autocorrelation (therefore less students per school with respect to their neighbours who have highest values), while Pomarolo and Volano have higher local moran’s \(I_i\) than Calliano.
# Mapping the p-value as color
pvalue_colors = c("white","#fedb71", "#F6BF70", "#E6866E","#DD696D")
pval <- as.numeric(lmI[,5])
colpval = ifelse(pval>0.05, pvalue_colors[1],
                 ifelse(pval>0.01 & pval<=0.05, pvalue_colors[2],
                        ifelse(pval>0.001 & pval<=0.01,pvalue_colors[3],
                               ifelse(pval>0.0001,pvalue_colors[4],pvalue_colors[5])))) 
tn$colpval[pval>0.05] <- "white" 
tn$colpval[pval<=0.05 & pval>0.01] <- gray(0.9) 
tn$colpval[pval<=0.01 & pval>0.001] <- gray(0.7)
tn$colpval[pval<=0.001 & pval>0.0001] <- gray(0.4)
tn$colpval[pval<=0.0001] <- "black"

plot(tn, col=colpval)
legend(x=11.5, y=45.9, legend=c("Not significant", 
       "p-value = 0.05", "p-value = 0.01", "p-value = 0.001", 
       "p-value = 0.0001"), fill=pvalue_colors, bty="n", cex=1)
title(main="Local Moran's I significance map")
text(tn_coords, labels = ifelse(colpval == "white","", tn@data$Comune), cex=0.6)

These results deviate from those discovered through the scatterplot, since here two big clusters are evident, while in the plots with all neighbourhood definitions territories with spatial autocorrelation are quite sparse. Nevertheless, some territories are in common, especially in Rovereto area.

Students over Population under 20

Moran’s Scatterplot

This last inquiry is due to the great gap discovered in the choropleth map inside the website that shows the proportion of actual students over the total population under 20 years old of every municipality. It seems in fact that some municipalities host more students than those who actually live in that area, leading to the possible conclusion that students may need to move from their city to another to go to school every day.

The remaining grid shows this time a negative trend of the relationship between x and the spatially lagged x, indicating that there are few close municipalities with a high number of students over the young population, but mostly of them share dissimilar values with their neighbours. In fact, this may be due to the problem highlighted before, related to the movement of students across different municipalities to go to school, especially middle, high and professional schools.

By assigning a quadrant to each municipality we can easily access to the number and the names of municipalities where students come from other areas or that have the necessity of moving to go to school. In fact, as shown in the assignation of numbers to quadrants, the most populated areas are HL and LH, showing a trend of negative spatial autocorrelation.

table(quadrants.knn)
## quadrants.knn
##   HL   LH   LL None 
##    6    4    2  154
table(quadrants.cont)
## quadrants.cont
##   HH   HL   LH None 
##    4    7    2  153
table(quadrants.dnb)
## quadrants.dnb
##   HH   HL   LH None 
##    2    5    2  157

In this case the color mapping has been changed to highlight the quadrants with most of the municipalities, whose values are dissimilar (i.e. HL in red, LH in yellow).

As in the previous two analysis of the scatterplot, knn and cut-off show more similarities than with the contiguity approach, thus pointing out the municipalities that host more students than those who actually live inside the territory (i.e. they welcome students from their neighbours).

The most critical situation is around Tione di Trento, which has a high number of students over its under 20 population, while Valdaone, Selle Giudicarie, Porte di Rendena and Pelugo lack of students. Notice however that we do not hold data about the number of students in Valdaone and most of them are missing in those zones.

Something similar but limited happens between Canazei and Giovanni di Fassa, since their closeness and the gap of students in the first against the aboundance in the second may imply that students in Canazei move to Giovanni di Fassa to go to school.

color_mapping = list("LL" = "#F6Bf70",
                     "LH" = "#FEDB71",
                     "HL" = "#E0716E",
                     "HH" = "#E6866E",
                     "None" = "white")

par(mfrow=c(3,1))
par(mar=c(0,0,0,0))
quadrants = list(list(quadrants.knn,"KNN"),
                 list(quadrants.dnb,"Cut-off"),
                 list(quadrants.cont,"Contiguity"))

for(l in quadrants){
  colourization = unlist(color_mapping[l[[1]]])
  par(mar=c(0,0,1,0))
  plot(tn, col=colourization, border = "grey", 
       main=paste0("Regions with influence on students over population under 20 (neighbourhood = ",l[[2]],")"))
  legend(x=11.38, y=45.95, 
         legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
         fill=unlist(color_mapping), bty="n", cex=0.8)
  text(tn_coords, labels = ifelse(l[[1]]=="None", "" ,tn@data$Comune), cex=0.7)
}

# Municipalities in common between KNN and cut-off
# (those that for sure host students from their neighbours)
tn$Comune[quadrants.knn != "None" & quadrants.dnb != "None"]
## [1] "Bondone"         "Cles"            "Ossana"          "Tione Di Trento"
# Cinte Tesino in common in contiguity and knn
tn$Comune[quadrants.cont != "None" & quadrants.knn != "None"]
## [1] "Cinte Tesino"
# Nothing in common between contiguity and cut-off
tn$Comune[quadrants.cont != "None" & quadrants.dnb != "None"]
## character(0)

Local’s Moran I

As in the previous section, also here not all municipalities detain data about students over the population under 20, therefore missing values are excluded. Since there are only few municipalities whose p-value is below \(0.05\) (i.e. Molveno, Cimone, Aldeno and Storo), the dataframe below accepts all rows whose p-value is below \(0.1\).

lmI <- localmoran(tn$Stud_Pop_under_20, dnb24.list, 
                  na.action = na.exclude)
lmI = data.frame(lmI) 
rownames(lmI) = tn$Comune
lmI_sign = lmI%>%
  filter(lmI$Pr.z....E.Ii..<0.1) %>%
  arrange(desc(abs(Ii)))
DT::datatable(lmI_sign)
# Proportion of significant municipalities in Moran's Scatterplot
# versus those identified through local Moran's
sum(c(tn$Comune[quadrants.knn != "None" | 
                                 quadrants.dnb != "None" | 
                                 quadrants.cont != "None"]) %in% rownames(lmI_sign))/dim(lmI_sign)[1]
## [1] 0.2352941

17 municipalities find a significant statistic for the local spatial autocorrelation and just \(24\%\) are in common with the ones found in the Moran scatterplots’ quadrants.

The following two plots show:

  • the territories where the Local Moran’s values are higher (darker colours) and lower (lighter colours). As usual, the white areas represent missing values municipalities, while labels are indicate NOT for the zones with statistical significance for Local Moran’s I, but for those who show highest/lowest values for \(I_i\) (let’s say above \(0.2\) and below \(-0.2\)). In this way, territories like Tione di Trento, Ossana, Cles and Rovereto are highlighted as in the Moran’s scatterplot.
par(mar=c(0,0,1,0))
brks <- sort(as.numeric(lmI[,1]))
colours <- colorRampPalette(c('#fedb71','#dd696d'))(length(lmI[,1]))
plot(tn, col=colours[findInterval(lmI[,1], brks, all.inside=TRUE)],
     border="grey30")
title(main="Local Moran's I values")
text(tn_coords, labels = ifelse(lmI$Ii>=0.2 | lmI$Ii<=-0.2, tn@data$Comune ,""), cex=0.7)

  • however, looking at the p-value, we can notice how it is extremely high for all observations and therefore that no local spatial autocorrelation can be confirmed for the municipalities found in the previous plot, just for Molveno, Storo, Cimone and Aldeno. Given the absence of statistical significance, we can discard the hypothesis of students overcrowding in specific areas.
# Mapping the p-value as color
pvalue_colors = c("white","#fedb71", "#F6BF70", "#E6866E","#DD696D")
pval <- as.numeric(lmI[,5])
colpval = ifelse(pval>0.05, pvalue_colors[1],
                 ifelse(pval>0.01 & pval<=0.05, pvalue_colors[2],
                        ifelse(pval>0.001 & pval<=0.01,pvalue_colors[3],
                               ifelse(pval>0.0001,pvalue_colors[4],pvalue_colors[5])))) 
tn$colpval[pval>0.05] <- "white" 
tn$colpval[pval<=0.05 & pval>0.01] <- gray(0.9) 
tn$colpval[pval<=0.01 & pval>0.001] <- gray(0.7)
tn$colpval[pval<=0.001 & pval>0.0001] <- gray(0.4)
tn$colpval[pval<=0.0001] <- "black"

plot(tn, col=colpval)
legend(x=11.5, y=45.9, legend=c("Not significant", 
       "p-value = 0.05", "p-value = 0.01", "p-value = 0.001", 
       "p-value = 0.0001"), fill=pvalue_colors, bty="n", cex=1)
title(main="Local Moran's I significance map")
text(tn_coords, labels = ifelse(colpval == "white","", tn@data$Comune), cex=0.6)

Spatial Regression Models

Population under 20 over the total population

Let’s start with the SDM model. The lagsarlm()9 function provides Maximum likelihood estimation of spatial simultaneous autoregressive lag and spatial Durbin (mixed) models of the form:

\[ Y = \rho Wy+X\beta+\epsilon \]

As shown in the summary, only the mean of students per school is a relevant predictor, none of the lag or the remaining predictors’ p-value is close to the \(0.05\) threshold to demonstrate statistical significance. Also, the overall p-value is above \(0.5\), therefore no significance is shown by the SDM.

library(spatialreg)

# Estimate the SDM model using the Maximum likelihood estimator
SDM <- lagsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti, tn, listw=dnb16.list,
                type="mixed", na.action = na.ignore)
summary(SDM)
## 
## Call:lagsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + 
##     Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore, type = "mixed")
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0822634 -0.0120025  0.0012275  0.0132263  0.0510890 
## 
## Type: mixed 
## Coefficients: (asymptotic standard errors) 
##                          Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)            2.0941e-01  4.2818e-02  4.8908 1.004e-06
## Stud_Pop_under_20     -3.5086e-03  3.7552e-03 -0.9343 0.3501250
## Scuole                 1.0641e-03  1.2364e-03  0.8606 0.3894323
## Media_stud_scuola      1.3926e-04  3.6987e-05  3.7652 0.0001664
## Studenti              -7.4425e-06  7.2944e-06 -1.0203 0.3075849
## lag.Stud_Pop_under_20 -1.2982e-02  1.6821e-02 -0.7718 0.4402583
## lag.Scuole            -3.3139e-03  4.6572e-03 -0.7116 0.4767314
## lag.Media_stud_scuola  8.7287e-05  1.1580e-04  0.7538 0.4509847
## lag.Studenti           2.6229e-05  2.8213e-05  0.9297 0.3525434
## 
## Rho: -0.14384, LR test value: 0.44322, p-value: 0.50557
## Asymptotic standard error: 0.23123
##     z-value: -0.62207, p-value: 0.5339
## Wald statistic: 0.38697, p-value: 0.5339
## 
## Log likelihood: 409.4587 for mixed model
## ML residual variance (sigma squared): 0.00042124, (sigma: 0.020524)
## Number of observations: 166 
## Number of parameters estimated: 11 
## AIC: -796.92, (AIC for lm: -798.47)
## LM test for residual autocorrelation
## test value: 0.0993, p-value: 0.75267

Same principle applies to SAR, which has an higher p-value than SDM, indicating the absence of statistical significance. The relevance of predictors remains unaltered.

# Estimate the SAR model using the Maximum likelihood estimator
SAR <- lagsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti, tn, listw=dnb16.list, na.action = na.ignore)
summary(SAR)
## 
## Call:lagsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + 
##     Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0851982 -0.0122352  0.0021982  0.0137631  0.0547369 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                      Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)        1.7100e-01  3.6728e-02  4.6558 3.228e-06
## Stud_Pop_under_20 -4.3445e-03  3.6708e-03 -1.1835    0.2366
## Scuole             1.6804e-03  1.2191e-03  1.3784    0.1681
## Media_stud_scuola  1.5913e-04  3.4866e-05  4.5642 5.014e-06
## Studenti          -1.1361e-05  7.1638e-06 -1.5858    0.1128
## 
## Rho: 0.050613, LR test value: 0.078455, p-value: 0.7794
## Asymptotic standard error: 0.18948
##     z-value: 0.26712, p-value: 0.78937
## Wald statistic: 0.071355, p-value: 0.78937
## 
## Log likelihood: 407.2789 for lag model
## ML residual variance (sigma squared): 0.0004329, (sigma: 0.020806)
## Number of observations: 166 
## Number of parameters estimated: 7 
## AIC: -800.56, (AIC for lm: -802.48)
## LM test for residual autocorrelation
## test value: 1.281, p-value: 0.25772

Now we skip to the local spillover specifications, starting from SDEM, which shows no significance in terms of p-value, both of the model and about the predictors.

# Estimate the SDEM model using the Maximum likelihood estimator
SDEM <- errorsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti, tn, listw=dnb16.list,
                   etype = "emixed", na.action = na.ignore)
summary(SDEM)
## 
## Call:errorsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + 
##     Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore, etype = "emixed")
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0823709 -0.0118217  0.0011253  0.0132666  0.0510749 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                          Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)            1.8366e-01  9.5850e-03 19.1616 < 2.2e-16
## Stud_Pop_under_20     -3.4720e-03  3.7524e-03 -0.9253 0.3548242
## Scuole                 1.1078e-03  1.2315e-03  0.8995 0.3683669
## Media_stud_scuola      1.3904e-04  3.7117e-05  3.7459 0.0001797
## Studenti              -7.7213e-06  7.2580e-06 -1.0638 0.2874074
## lag.Stud_Pop_under_20 -1.1487e-02  1.5701e-02 -0.7316 0.4644119
## lag.Scuole            -3.3237e-03  4.3966e-03 -0.7560 0.4496691
## lag.Media_stud_scuola  5.3972e-05  1.0057e-04  0.5366 0.5915187
## lag.Studenti           2.6360e-05  2.6723e-05  0.9864 0.3239349
## 
## Lambda: -0.14789, LR test value: 0.44493, p-value: 0.50475
## Asymptotic standard error: 0.2341
##     z-value: -0.63174, p-value: 0.52756
## Wald statistic: 0.3991, p-value: 0.52756
## 
## Log likelihood: 409.4596 for error model
## ML residual variance (sigma squared): 0.0004212, (sigma: 0.020523)
## Number of observations: 166 
## Number of parameters estimated: 11 
## AIC: -796.92, (AIC for lm: -798.47)

The same happens for the SEM.

# Estimate the SEM model using the Maximum likelihood estimator
SEM <- errorsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti, tn, listw=dnb16.list, na.action = na.ignore)
summary(SEM)
## 
## Call:errorsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + 
##     Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0849250 -0.0123802  0.0025571  0.0132241  0.0550477 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                      Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)        1.8066e-01  3.6292e-03 49.7798 < 2.2e-16
## Stud_Pop_under_20 -4.5948e-03  3.6494e-03 -1.2590    0.2080
## Scuole             1.6581e-03  1.2192e-03  1.3600    0.1738
## Media_stud_scuola  1.6298e-04  3.4221e-05  4.7625 1.912e-06
## Studenti          -1.1209e-05  7.1655e-06 -1.5643    0.1177
## 
## Lambda: -0.043474, LR test value: 0.040879, p-value: 0.83977
## Asymptotic standard error: 0.22263
##     z-value: -0.19528, p-value: 0.84518
## Wald statistic: 0.038133, p-value: 0.84518
## 
## Log likelihood: 407.2601 for error model
## ML residual variance (sigma squared): 0.00043301, (sigma: 0.020809)
## Number of observations: 166 
## Number of parameters estimated: 7 
## AIC: -800.52, (AIC for lm: -802.48)

On the other hand, the LDM shows the same relevance for predictors, but a low high-value, demonstrating the significance of this estimation.

# Estimate the LDM model using the OLS estimator
LDM <- lmSLX(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti,tn, listw=dnb16.list, na.action = na.ignore)
summary(LDM)
## 
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.082928 -0.011458  0.001065  0.013532  0.051027 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.832e-01  1.060e-02  17.286  < 2e-16 ***
## Stud_Pop_under_20     -3.402e-03  3.868e-03  -0.879 0.380477    
## Scuole                 1.150e-03  1.272e-03   0.904 0.367383    
## Media_stud_scuola      1.383e-04  3.810e-05   3.630 0.000383 ***
## Studenti              -7.986e-06  7.500e-06  -1.065 0.288594    
## lag.Stud_Pop_under_20 -1.118e-02  1.716e-02  -0.651 0.515795    
## lag.Scuole            -3.319e-03  4.798e-03  -0.692 0.490169    
## lag.Media_stud_scuola  6.130e-05  1.078e-04   0.568 0.570514    
## lag.Studenti           2.585e-05  2.907e-05   0.889 0.375166    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02115 on 157 degrees of freedom
## Multiple R-squared:  0.1814, Adjusted R-squared:  0.1396 
## F-statistic: 4.348 on 8 and 157 DF,  p-value: 9.297e-05

The calculation of impacts for spatial lag and spatial Durbin models is needed in order to interpret the regression coefficients correctly, because of the spillovers between the terms in these data generation processes[spdep Documentation, https://www.rdocumentation.org/packages/spdep/versions/1.0-2/topics/impacts][LeSage J and RK Pace (2009) Introduction to Spatial Econometrics. CRC Press, Boca Raton, pp. 33–42, 114–115; LeSage J and MM Fischer (2008) Spatial growth regressions: model specification, estimation and interpretation. Spatial Economic Analysis 3 (3), pp. 275–304]10. The method impacts returns the average direct, indirect and total impacts for the variables in the model, for the variables themselves in the spatial lag model case, for the variables and their spatial lags in the spatial Durbin (mixed) model case.

The Lagrange multiplier (LM) test of spatial dependence on OLS residuals. In the LM test the alternative hypothesis is explicitly considered to contrast the null #of absence of spatial dependence. In particular, we can explicitly express the alternative hypothesis either in the form of a SL or of a SEM.

The function lm.LMtests() reports the estimates of tests chosen among 4 statistics for testing for spatial dependence in linear models, which are:

  • LMerr: LM test for error dependence;
  • LMlag, simple LM test for a missing spatially lagged dependent variable;
  • their robust version called RLMerr and RLMlag.

According to this diagnostic, no model has been found significant by looking at the p-value.

OLSmodel <- lm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti, data = tn, na.action = na.ignore)
natOLSlmTests <- lm.LMtests(OLSmodel, dnb16.list, 
                    test=c("LMerr", "LMlag", "RLMerr", "RLMlag"))
summary(natOLSlmTests)
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole +
## Media_stud_scuola + Studenti, data = tn, na.action = na.ignore)
## weights: dnb16.list
##  
##        statistic parameter p.value
## LMerr   0.042152         1  0.8373
## LMlag   0.090537         1  0.7635
## RLMerr  1.141962         1  0.2852
## RLMlag  1.190346         1  0.2753

In conclusion, a positive significant direct impact on the proportion of population under 20 over the total population has been found through the mean number of students per school, implying that higher levels of potential students in the territory lead to the increase of mean students per school in the same neighbourhood.

Mean of students per class

Again, we can repeat the same analysis with the mean of students per class, using the same predictors used in the OLS residuals section (i.e. optimal model).

Starting from SDM, all predictors seem to gain high relevance given the low p-value, while their lagged values are relevant only for the mean students per school and slightly for the number of classes. The overall p-value of the model is below \(0.05\) and therefore considered statistically significant.

SDM <- lagsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + Studenti, tn, listw=dnb16.list,
                type="mixed", na.action = na.ignore)
summary(SDM)
## 
## Call:lagsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole + 
##     Classi + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore, type = "mixed")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -11.17522  -1.30828   0.03293   1.10682   7.30820 
## 
## Type: mixed 
## Coefficients: (asymptotic standard errors) 
##                         Estimate Std. Error z value  Pr(>|z|)
## (Intercept)            8.3716051  1.8391923  4.5518 5.319e-06
## Stud_Pop_under_20      7.0792138  0.4082789 17.3392 < 2.2e-16
## Scuole                 0.8148910  0.1296859  6.2836 3.309e-10
## Classi                -0.3689308  0.0487037 -7.5750 3.597e-14
## Media_stud_scuola      0.0392220  0.0037829 10.3683 < 2.2e-16
## Studenti               0.0138605  0.0024529  5.6507 1.598e-08
## lag.Stud_Pop_under_20  3.2655810  2.4964221  1.3081   0.19084
## lag.Scuole             1.0207311  0.5441222  1.8759   0.06067
## lag.Classi            -0.5003442  0.2590131 -1.9317   0.05339
## lag.Media_stud_scuola  0.0408093  0.0161537  2.5263   0.01153
## lag.Studenti           0.0200382  0.0125349  1.5986   0.10991
## 
## Rho: -0.55537, LR test value: 4.4195, p-value: 0.035531
## Asymptotic standard error: 0.26076
##     z-value: -2.1298, p-value: 0.033185
## Wald statistic: 4.5362, p-value: 0.033185
## 
## Log likelihood: -358.8017 for mixed model
## ML residual variance (sigma squared): 4.3451, (sigma: 2.0845)
## Number of observations: 166 
## Number of parameters estimated: 13 
## AIC: 743.6, (AIC for lm: 746.02)
## LM test for residual autocorrelation
## test value: 5.8818, p-value: 0.015298

In SAR case instead, despite all predictors seem to have high relevance, the spatial autoregressive parameter \(\rho\) (Rho) is not significant due to high p-value on both the asymptotic t-test and Likelihood ratio test, but the LM test for residual autocorrelation has low p-value, indicating the presence of spatial autocorrelation in the residuals.

SAR <- lagsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + Studenti, tn, listw=dnb16.list, na.action = na.ignore)
summary(SAR)
## 
## Call:lagsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole + 
##     Classi + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -11.61926  -1.11213  -0.15396   1.17342   7.59740 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                     Estimate Std. Error z value  Pr(>|z|)
## (Intercept)        5.4160302  0.9076190  5.9673 2.412e-09
## Stud_Pop_under_20  7.0058916  0.4183042 16.7483 < 2.2e-16
## Scuole             0.8438018  0.1303428  6.4737 9.562e-11
## Classi            -0.3709965  0.0493342 -7.5201 5.485e-14
## Media_stud_scuola  0.0406608  0.0037524 10.8360 < 2.2e-16
## Studenti           0.0137928  0.0024817  5.5578 2.733e-08
## 
## Rho: 0.09763, LR test value: 1.81, p-value: 0.17851
## Asymptotic standard error: 0.076863
##     z-value: 1.2702, p-value: 0.20402
## Wald statistic: 1.6133, p-value: 0.20402
## 
## Log likelihood: -363.5075 for lag model
## ML residual variance (sigma squared): 4.6699, (sigma: 2.161)
## Number of observations: 166 
## Number of parameters estimated: 8 
## AIC: 743.02, (AIC for lm: 742.83)
## LM test for residual autocorrelation
## test value: 3.9831, p-value: 0.04596

In case of local spillover specifications, all predictors are relevant, but not their lagged version. The \(\lambda\) (Lambda) estimator is highly significant due to low p-value on both Likelihood ratio and asymptotic t-tests.

SDEM <- errorsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + Studenti, tn, listw=dnb16.list,
                   etype = "emixed", na.action = na.ignore)
summary(SDEM)
## 
## Call:errorsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 + 
##     Scuole + Classi + Media_stud_scuola + Studenti, data = tn, 
##     listw = dnb16.list, na.action = na.ignore, etype = "emixed")
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -10.835976  -1.281293  -0.071948   1.124115   7.072121 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                         Estimate Std. Error z value  Pr(>|z|)
## (Intercept)            5.1633884  0.7473589  6.9088 4.886e-12
## Stud_Pop_under_20      7.0991172  0.4126814 17.2024 < 2.2e-16
## Scuole                 0.8014391  0.1303753  6.1472 7.888e-10
## Classi                -0.3631380  0.0482991 -7.5185 5.529e-14
## Media_stud_scuola      0.0379363  0.0038330  9.8973 < 2.2e-16
## Studenti               0.0136307  0.0024354  5.5970 2.181e-08
## lag.Stud_Pop_under_20 -0.3096908  1.2769796 -0.2425   0.80838
## lag.Scuole             0.4894061  0.3759430  1.3018   0.19298
## lag.Classi            -0.3029760  0.2105152 -1.4392   0.15009
## lag.Media_stud_scuola  0.0159678  0.0095859  1.6658   0.09576
## lag.Studenti           0.0130679  0.0103491  1.2627   0.20669
## 
## Lambda: -0.80016, LR test value: 7.2953, p-value: 0.0069134
## Asymptotic standard error: 0.27703
##     z-value: -2.8884, p-value: 0.0038722
## Wald statistic: 8.3428, p-value: 0.0038722
## 
## Log likelihood: -357.3637 for error model
## ML residual variance (sigma squared): 4.2043, (sigma: 2.0504)
## Number of observations: 166 
## Number of parameters estimated: 13 
## AIC: 740.73, (AIC for lm: 746.02)

In case of SEM instead, the estimator is not statistically significant in both tests due to p-value around \(0.3\).

SEM <- errorsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + Studenti, tn, listw=dnb16.list, na.action = na.ignore)
summary(SEM)
## 
## Call:errorsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 + 
##     Scuole + Classi + Media_stud_scuola + Studenti, data = tn, 
##     listw = dnb16.list, na.action = na.ignore)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -11.372522  -1.197946  -0.013962   1.361034   7.809514 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                     Estimate Std. Error z value  Pr(>|z|)
## (Intercept)        6.3045147  0.3614975 17.4400 < 2.2e-16
## Stud_Pop_under_20  6.8364576  0.4094121 16.6982 < 2.2e-16
## Scuole             0.8768414  0.1302784  6.7305 1.691e-11
## Classi            -0.3783119  0.0494259 -7.6541 1.954e-14
## Media_stud_scuola  0.0435938  0.0034738 12.5491 < 2.2e-16
## Studenti           0.0139824  0.0024918  5.6113 2.008e-08
## 
## Lambda: -0.25333, LR test value: 0.96423, p-value: 0.32612
## Asymptotic standard error: 0.24437
##     z-value: -1.0366, p-value: 0.2999
## Wald statistic: 1.0746, p-value: 0.2999
## 
## Log likelihood: -363.9304 for error model
## ML residual variance (sigma squared): 4.6797, (sigma: 2.1633)
## Number of observations: 166 
## Number of parameters estimated: 8 
## AIC: 743.86, (AIC for lm: 742.83)

Also for LDM predictors are relevant, except for their lagged values. The p-value is below the \(0.05\) threshold, concluding the significance of this model.

LDM <- lmSLX(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + Studenti,tn, listw=dnb16.list, na.action = na.ignore)
summary(LDM)
## 
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4798  -1.2584  -0.0412   1.0929   7.3924 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.290053   1.127564   4.692 5.91e-06 ***
## Stud_Pop_under_20      7.066653   0.431613  16.373  < 2e-16 ***
## Scuole                 0.803002   0.137004   5.861 2.67e-08 ***
## Classi                -0.372470   0.051483  -7.235 2.02e-11 ***
## Media_stud_scuola      0.039215   0.003993   9.821  < 2e-16 ***
## Studenti               0.014097   0.002593   5.436 2.07e-07 ***
## lag.Stud_Pop_under_20 -0.724507   1.824028  -0.397    0.692    
## lag.Scuole             0.492397   0.512888   0.960    0.339    
## lag.Classi            -0.293354   0.251452  -1.167    0.245    
## lag.Media_stud_scuola  0.014574   0.011327   1.287    0.200    
## lag.Studenti           0.012448   0.012624   0.986    0.326    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.204 on 155 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9032 
## F-statistic:   155 on 10 and 155 DF,  p-value: < 2.2e-16

All predictors’ impacts are significant, if looking at direct and total impacts. Results show particular significance for:

  • the number of students over the population under 20: on agerage, because of spatial spillovers, a change in a neighbourhood would lead to a cumulative increase of \(7.76\) students per class;
  • the number of classes: on average, because of spatial spillovers, a change in a neighborhood would also lead to a cumulative decrease of \(0.411\) (half student per class);
  • number of schools: on average, because of spatial spillovers, a change in a neighborhood would also lead to a cumulative increase of \(0.935\) (nearly one student per class).

Also, a cumulative increase will happen by changing neighbourhoods if considering an increase in terms of mean of students per school and number of students, with lower impact than the previous three mentioned predictors.

impSAR <- impacts(SAR, listw=dnb16.list, R=100)
summary(impSAR, zstats=TRUE, short=TRUE)
## Impact measures (lag, exact):
##                        Direct     Indirect       Total
## Stud_Pop_under_20  7.01020865  0.753670577  7.76387923
## Scuole             0.84432172  0.090773394  0.93509511
## Classi            -0.37122509 -0.039910570 -0.41113566
## Media_stud_scuola  0.04068585  0.004374154  0.04506001
## Studenti           0.01380133  0.001483787  0.01528511
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
##                        Direct    Indirect       Total
## Stud_Pop_under_20 0.452330015 0.691290352 0.923475851
## Scuole            0.128019566 0.080993591 0.151433385
## Classi            0.047550669 0.034698936 0.058281087
## Media_stud_scuola 0.004460688 0.003931098 0.005611523
## Studenti          0.002415223 0.001294252 0.002824999
## 
## Simulated z-values:
##                      Direct  Indirect     Total
## Stud_Pop_under_20 15.457710  1.131066  8.418067
## Scuole             6.598456  1.124764  6.179814
## Classi            -7.800642 -1.158092 -7.053923
## Media_stud_scuola  9.205845  1.128736  8.108596
## Studenti           5.700484  1.153396  5.402028
## 
## Simulated p-values:
##                   Direct     Indirect Total     
## Stud_Pop_under_20 < 2.22e-16 0.25803  < 2.22e-16
## Scuole            4.1546e-11 0.26069  6.4177e-10
## Classi            6.2172e-15 0.24683  1.7395e-12
## Media_stud_scuola < 2.22e-16 0.25901  4.4409e-16
## Studenti          1.1947e-08 0.24875  6.5892e-08

In this second case, the p-value is higher than before, but still implying significance mainly for students over population under \(20\) (increase), number of schools (increase) and number of classes (decrease).

#For the SD specification:
impSDM <- impacts(SDM, listw=dnb16.list, R=100)
summary(impSDM, zstats=TRUE, short=TRUE)
## Impact measures (mixed, exact):
##                        Direct    Indirect       Total
## Stud_Pop_under_20  7.09719669 -0.44619729  6.65099940
## Scuole             0.79955069  0.38062945  1.18018014
## Classi            -0.36095370 -0.19793100 -0.55888471
## Media_stud_scuola  0.03870830  0.01274642  0.05145472
## Studenti           0.01352731  0.00826724  0.02179455
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
##                        Direct    Indirect       Total
## Stud_Pop_under_20 0.420075417 1.053428744 1.018988807
## Scuole            0.137524385 0.322052400 0.326114072
## Classi            0.047700732 0.171224404 0.165702318
## Media_stud_scuola 0.003614372 0.007334518 0.007246858
## Studenti          0.002403222 0.008611326 0.008526823
## 
## Simulated z-values:
##                      Direct   Indirect     Total
## Stud_Pop_under_20 16.850589 -0.5012483  6.428421
## Scuole             5.825831  1.3514950  3.791453
## Classi            -7.599245 -1.1080040 -3.332524
## Media_stud_scuola 10.714155  1.8933487  7.259937
## Studenti           5.655192  0.8653011  2.467750
## 
## Simulated p-values:
##                   Direct     Indirect Total     
## Stud_Pop_under_20 < 2.22e-16 0.616196 1.2894e-10
## Scuole            5.6829e-09 0.176537 0.00014977
## Classi            2.9754e-14 0.267860 0.00086062
## Media_stud_scuola < 2.22e-16 0.058312 3.8725e-13
## Studenti          1.5567e-08 0.386874 0.01359651

Focusing on the Lagrange multiplier diagnostics for spatial dependence, let’s try focusing on the same model but reducing the number of predictors to those with higher impacts on both SEM and SDM models.

According to the summary, all statistics are significant, except for Robust Linear Model test for error dependence (RLMerr).

OLSmodel <- lm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi, data = tn, na.action = na.ignore)
natOLSlmTests <- lm.LMtests(OLSmodel, dnb16.list, 
                    test=c("LMerr", "LMlag", "RLMerr", "RLMlag"))
summary(natOLSlmTests)
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole +
## Classi, data = tn, na.action = na.ignore)
## weights: dnb16.list
##  
##        statistic parameter   p.value    
## LMerr   12.82227         1 0.0003425 ***
## LMlag   23.97668         1 9.751e-07 ***
## RLMerr   0.38852         1 0.5330771    
## RLMlag  11.54293         1 0.0006801 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The selection strategy of the above mentioned models proposed by Elhorst (2010) is to:

  1. Estimate the OLS model and test (with the LM test) whether the SL or the SEM is more appropriate to describe the data;

  2. If the OLS model is rejected in favour of the SL, the SEM or in favour of both models, then the SDM should be estimated;

  3. Likelihood ratio (LR) tests can subsequently be used to test whether:

    • the SDM can be simplified to the SLM,
    • whether it can be simplified to the SEM.

If both hypotheses are rejected, then the SDM best describes the data. If the hypothesis i) cannot be rejected, then the SLM best describes the data, provided that the (robust) LM tests also pointed to the SLM

If the hypothesis ii) cannot be rejected, then the SEM best describes the data, provided that the (robust) LM tests also pointed to the SEM. We can perform LR tests of restrictions on the parameters of spatial models using the function anova().

# Test hypothesis i)
anova(SAR, SDM)
##     Model df    AIC  logLik Test L.Ratio  p-value
## SAR     1  8 743.02 -363.51    1                 
## SDM     2 13 743.60 -358.80    2  9.4117 0.093727
# Test hypothesis ii)
anova(SDM, SEM)
##     Model df    AIC  logLik Test L.Ratio  p-value
## SDM     1 13 743.60 -358.80    1                 
## SEM     2  8 743.86 -363.93    2  10.257 0.068261
# 
anova(SDEM, SEM)
##      Model df    AIC  logLik Test L.Ratio  p-value
## SDEM     1 13 740.73 -357.36    1                 
## SEM      2  8 743.86 -363.93    2  13.133 0.022161

  1. Tobler, Waldo R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46 (2): 234–40. http://www.geog.ucsb.edu/~tobler/publications/pdf_docs/A-Computer-Movie.pdf.↩︎

  2. Wikipedia page of Centroid, https://en.wikipedia.org/wiki/Centroid↩︎

  3. How to find the centre of a polygon i Python, https://deparkes.co.uk/2015/02/28/how-to-find-the-centre-of-a-polygon-in-python/↩︎

  4. saveGIF() documentation https://www.rdocumentation.org/packages/animation/versions/2.4.1/topics/saveGIF↩︎

  5. Spatial regression models documentation, https://rspatial.org/raster/analysis/7-spregression.html↩︎

  6. Spatial Autocorrelation, https://ibis.geog.ubc.ca/courses/geob479/notes/spatial_analysis/spatial_autocorrelation.htm#↩︎

  7. LeSage, J., and R.K. Pace. 2009. Introduction to Spatial Econometrics. Statistics: A Series of Textbooks and Monographs. CRC Press↩︎

  8. LeSage, J. 2014. “What Regional Scientists Need to Know About Spatial Econometrics.” Review of Regional Studies 44 (1): 13–32↩︎

  9. https://cran.r-project.org/web/packages/spatialreg/spatialreg.pdf↩︎

  10. Roger Bivand, Gianfranco Piras (2015). Comparing Implementations of Estimation Methods for Spatial Econometrics. Journal of Statistical Software, 63(18), 1-36. https://www.jstatsoft.org/v63/i18/↩︎